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consider straight-line programs (SLP), since all algorithms on SLP-generated strings could be applied to processing 

p> |- LZ-compressed texts. 

■^ ' The main result is a new algorithm for pattern matching when both a text T and a pattern P are presented 

by SLPs (so-called fully compressed pattern matching problem). We show how to find a first occurrence, count all 

occurrences, check whether any given position is an occurrence or not in time 0{n^m). Here m,n are the sizes 

of straight-line programs generating correspondingly P and T. 

Then we present polynomial algorithms for computing fingerprint table and compressed representation of all 

covers (for the first time) and for finding periods of a given compressed string (our algorithm is faster than previously 

known). On the other hand, we show that computing the Hamming distance between two SLP-generated strings is 

C/3 , NP- and CoNP-hard. 
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I. Introduction 



^ ■ Background. How to solve text problems, if instead of an input string we get only program generating 

(^ , it? Is it possible to find a faster solution than just "generate text -i- apply classical algorithm"? In this paper 

■^ ■ we consider strings generated by straight-line programs (SLP). These are programs using only assignment 

S operator. The exact definition and discussion on this notion are given in Section |lll 

O ■ We come to this question studying algorithms on compressed texts. Many algorithms for direct search 

c^ ! in compressed texts without unpacking were proposed in the last decade. Finally (see Rytter's work 

.. \ [22]), it turns out that only decompression stage really matters. More precisely, we can consider a pair 

.^ (decompression algorithm, archive file) as a generating description for the original text. Rytter presents 

^ an effective translation from any LZ-compressed text to the straight-line program generating the same 

■ - - Here we study complexity of solving problems as a function of the size of a program that generates 

input. Since the ratio between the size of SLP and the length of original string might be exponential, even 
any polynomial algorithm is a matter of interest. Main purpose of such algorithms is to speed up naive 
approach and to save memory (since we never generate the original text). Moreover, even if we have just 
original text data, but containing a lot of repetitions, we can find generating description first and then 
apply one of our algorithms for SLP-generated texts. A similar idea (for OBDD-described functions) is 
used in symbolic model checking. 

There are already promising applications for work with compressed objects: solving word equations in 
polynomial space [20], and pattern matching in message sequence charts [9]. Potential applications include 
all fields with highly compressible data like bio-informatics (genomes contain a lot of repetitions), pattern 
matching in statistical data (like internet log files) and any kinds of automatically generated texts. 

Problem. For any text problem on SLP-generated strings we ask two questions: (1) does any polynomial 
algorithm exist? (2) If yes, what is exact complexity of the problem? We can think about negative answer 
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to the first question (say, NP-hardness) as an evidence that naive "generate-and-solve" is the best way for 
that particular problem. 

First of all we consider complexity of pattern matching problem on SLP-generated strings (sometimes 
called fully compressed pattern matching or FCPM). That is, given a SLPs generating pattern P and text 
T to answer whether P is a substring of T and provide a succinct description of all occurrences. An 
important special case is equivalence problem for SLP-generated texts. Then we expand the obtained 
algorithm/technique for several other string problems: finding shortest period/cover and constructing 
fingerprint table. Finally we consider the Hamming distance problem. 

Results. The key result of the paper is a new O(n^m) algorithm for pattern matching on SLP-generated 
texts, where m and n are sizes of SLPs generating P and T, correspondingly. This is an improvement 
of the 0{{n + niY log^ \T\) algorithm by G^sieniec et. al [8] and 0{'n?'m?') algorithm by Miyazaki et al. 
[16]. For one quite special class of SLP the FCPM problem was solved in time 0{mn) [11]. 

After presenting the main algorithm in Section IHTI we sketch (more details will follow in a full version of 
paper) the polynomial algorithms for finding covers and periods, and for computing a fingerprint table for 
SLP-generated texts in Section |lVl On the other hand, we show that surprisingly, computing the Hamming 
distance is NP-hard. Here we have a closely related problems (Hamming distance and equivalence) from 
different sides of the border between efficiently solvable problems on SLP-generated texts and intractable 
ones. 

Proofs. The main ingredients of the algorithm are dynamic programming method, operations with 
arithmetical progressions and special tricks in case of dense sequence of pattern occurrences. The following 
way to work with SLP-generated strings turns out to be the most productive: (1) invent some auxiliary 
property of strings (2) compute it for all intermediate texts (3) derive the answer from the computed 
array. In our pattern matching algorithm we consecutively compute elements of a special n x m table. 
This is exactly the same idea as in [16]. However, the routine for computing a new element is completely 
different: we use only 0{n) time while in the paper [16] 0{mn) time is used. One of the key tricks 
comes from checking occurrence of false mismatches for Rabin-Karp algorithm as in [17]. However we 
just apply the same intuition but not directly the same procedure. 

Immediate consequence of our result is an 0{'n?m) algorithm for pattern matching in LZ-compressed 
texts. Our algorithm uses only linear time for one step of dynamic programming. Hence, in order to get 
faster method, we now need a radically new approach. 



A. Related results 

The whole field started from papers by Amir, Benson and Farach [2] and by Farach and Thorup [6] 
presenting algorithms for compressed pattern matching with working time depending on size of compressed 
text. Compression models vary from run-length encoding, different members of Lempel-Ziv family [24], 
straight-line programs (SLP) and collage systems [13]. The last two are good theoretical models that 
generalize all previously studied compression algorithms. Namely, Rytter [22] proved that any LZ-encoding 
could be efficiently translated to SLP of approximately the same size without unpacking. 

In addition to pattern matching, also algorithms for window subsequence (i.e. scattered substring) 
search [5], membership in regular language [18], and approximate pattern matching [12] were presented. 
From the other side, it was shown that fully compressed subsequence problem and the longest common 
subsequence are 62-hard (this is a kind of closure of NP and CONP), compressed and fully compressed 
two-dimensional pattern matching are NP-complete and S2-complete, respectively [4], while membership 
in a context-free language is even PSPACE-complete [15]. 

Encouraging experimental results are reported in [19]. The papers [21], [23] survey the field of 
processing compressed texts, while the book [10] provides a general overview of classical pattern matching 
algorithms. 



II. Compressed Strings Are Straight-Line Programs 

A Straight-line program is a context-free grammar generating exactly one string. Moreover, we allow 
only two types of productions: Xi ^ a and Xi -^ XpXg with i > p,q. The string presented by a given 
SLP is a unique text corresponding to the last nonterminal Xm- Although in previous papers Xi denotes 
only a nonterminal symbol while the corresponding text was denoted by val{Xi) or eval{Xi) we will 
identify this notions and use X^ both as a nonterminal symbol and as the corresponding text. We say that 
the size of SLP is equal to the number of productions. 

Example. Consider string abaababaabaab. It could be generated by the following SLP: 



Xj -^ XgXs, Xq — > X5X4, X^ -^ X4X: 
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Xi -^b. 



In fact, the notion of SLP describes only decompression operation. 
We do not care how such an SLP was obtained. Surprisingly, while the 
compression methods vary in many practical algorithms of Lempel- 
Ziv family and run-length encoding, the decompression goes in almost 
the same way. In 2003 Rytter [22] showed that given a LZ-encoding 
of string T we could efficiently get an SLP encoding for the same 
string which is at most 0(log \T\) times longer than the original LZ- 
encoding. This translation allows us to construct algorithms only in the 
simplest SLP model. If we get a different encoding, we just translate it 
to SLP before applying our algorithm. Moreover, if we apply Rytter' s 

translation to LZ77-encoding of a string T, then we get an 0(/o5f|T|)-approximation of the minimal SLP 
generating T. 

The straight-line programs allow the exponential ratio between the size of SLP and the length of original 
text. Consider, for example 
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In complexity analysis we use both log \T\ and n (number of rules in SLP). For example, we prefer 
0(nlog \T\) bounds to O(n^), since in practice the ratio between the size of SLP and the length of the 
text might be much smaller than exponential. 



III. A New Algorithm for Fully Compressed Pattern Matching 
Decision version of the fully compressed pattern matching problem (FCPM) is as follows: 



INPUT: Two straight-line programs generating P and T 
OUTPUT: Yes/No (whether P is a substring in T?) 



Other variations are: to find the first occurrence, to count all occurrences, to check whether there is an 
occurrence from the given position and to compute a "compressed" representation of all occurrences. Our 
plan is to solve the last one, that is, to compute auxiliary data structure that will contain all necessary 
information for effective answering on the other questions. 

We need some notation and terminology. We call a position in the text a point between two consequent 
letters. Hence, the text ai . . . a„ has positions 0, . . . ,n where first is in front of the first letter and the 
last one after the last letter. We say that some substring touches a given position if this position is either 
inside or on the border of this substring. We use the term occurrence both as a corresponding substring 
and as its starting position. Hopefully, the right meaning is always clear from the context. 

Let Pi , . . . , P,„ and Ti , . . . , T„ be the nonterminal symbols of SLPs generating P and T. For each 
of these texts we define a special cut position. It is a starting position for one-letter texts and merging 
position for X^ = XrXg. In the example above, the cut position for the intermediate text Xq is between 
5th and 6th letters: abaab\aba, since Xq was obtained by concatenating X^ = abaab and X4 = aba. 



We use a computational assumption which was used in all previous algorithms but not stated explicitly. 
In analysis of our algorithm we count arithmetical operations on text positions as unit operations. In fact, 
text positions are integers with at most log \T\ bits in binary form. Hence, the bit operation complexity 
is larger than our 0{n^m) up to some log |T| -dependent factor. 

Explanation of the algorithm goes in three steps. We introduce a special data structure {AP -table) and 
show how to solve pattern matching problem using this table in Subsection IIII-AI Then we show how to 
compute AP-table using Local PM procedure in Subsection IIII-Bl Finally, we present an algorithm for 
Local PM in Subsection IIII-Cl 



A. Idea of the algorithm 

Our algorithm is based on the following theoretical fact (it was already used in [16]): 

Lemma 1 (Basic lemma): All occurrences of P in T touching any given position form a single 
arithmetical progression (ar.pr.) 
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The AP-table is defined as follows. For every 1 < i < m,! < j < n the value AP[i,j] is a code of 
ar.pr. of occurrences of Pi in Tj that touch the cut of Tj . Note that any ar.pr. could be encoded by three 



integers: first position, difference, number of elements. If \Tj\ < 
is large enough but there are no occurrences we define AP[i,j] - 
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Claim 1: Using AP-table (actually, only top row is necessary) we can solve decision, count and checking 
versions of FCPM in time 0{n). 

Claim 2: We can count the whole AP-table by dynamic programming method in time O(n^m). 



Proof of Claim 1. We get the answer for decision FCPM by the following rule: P occurs in T iff there 
is j such that AP[m,j] is nonempty. Checking and counting are slightly more tricky. Recursive algorithm 
for checking: test whether the candidate occurrence touches the cut in the current text. If yes, use AP-table 
and check the membership in the corresponding ar.pr., otherwise call recursively this procedure either for 
the left or for the right part. We will inductively count the number of P-occurrences in all Ti, . . . , T^. To 
start, we just get the cardinality of the corresponding ar.pr. from AP-table. Inductive step: add results for 
the left part, for the right part and cardinality of central ar.pr. without "just-touching" occurrences. 



B. Computing AP-table 



Sketch of the algorithm for computing AP-table. 

1) Precomputation: compute lengths and cut positions for all intermediate texts; 

2) Compute the first row and the first column of AP-table; 

3) From j=2 to n do From i=2 to m do Compute AP[i,j] 

a) Compute occurrences of the larger part of Pi in Tj around the cut of Tj ; 

b) Compute occurrences of the smaller part of Pi in Tj that starts at ending 
positions of the larger part occurrences; 

c) Intersect occurrences of smaller and larger part of Pi and merge all results 
to a single ar.pr. 



At the very beginning we inductively compute tables for lengths, cut positions, first letter, last letter of 
the texts Pi, ... , P^i Ti, . . . ,Tn in time 0{n + m). 

We compute elements of AP-table one-by-one. At the beginning we compute the first row and the first 
column, then in the following order: 

From j=2 to n do From i=2 to m do Compute AP[i,j] 

Case of \Pi\ = 1: compare with Tj if |T,| = 1 or compare with the last letter of the left part and the 
first one of the right part (we get this letters from precomputation stage). The resulting ar.pr. will be one 
or two neighbor positions. Hence, just 0(1) time used for computing every cell in the table. 

Case of \Tj\ = 1: if \Pi\ > 1 return 0, else compare letters. Also 0(1) time is enough for every element. 

Induction step: let Pi and Tj be both of length greater than one. We are going to compute a new element 
in time 0{n) using already computed values of AP-table. For this purpose, we design a special auxiliary 
procedure that extracts useful information from already computed part of AP-table. 

This is a border between ideas of Miyazaki et al. and the new approach presented in the paper. They 
compute the same table, but new element routines are completely different. 

Procedure LocalPM(i,j, [a, (3]) returns occurrences of Pi in Tj inside the interval [a,P]. 

Important properties: 

• Local PM uses values AP[i,k] for 1 < /c < j, 

• It is defined only when \(3 — a\ < 3\Pi\, 

• It works in time 0{j), 

• The output of Local PM is a pair of ar.pr., all occurrences inside each ar.pr. have a common position, 
and all elements of the second are to the right of all elements of the first. 

We now show how to compute a new element using 5 Local PM calls. Let Pi = PrPs, the cut position 
in Tj be 7 (we get it from precomputation stage) and, without loss of generality, let |Fr| > |-Ps|- The 
intuitive way is (1) to compute all occurrences of Pr "around" cut of Tj, (2) to compute all occurrences 
of Ps "around" cut of Tj, and (3) shift the latter by |Fr| and intersect. 
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Any occurrence of Pi that touches the cut 7 of Tj consists of occurrence of Pr and occurrence of Pg 
inside |Pj| -neighborhood of cut where ending position of Pr equals starting position of Pg. 

We apply Local PM for finding all occurrences of Pr in the interval [7 — |Pj|,7 + \Pr\]- Its length is 
\Pi\ + \Pr\ < 3\Pr\. As an answer we get two ar.pr. of all potential starts of Pi occurrences that touch 
the cut. Unfortunately, we are not able to do the same for Pg, since the length of interesting interval is 
not necessarily constant in terms of \Ps\. So we are going to find only occurrences of Pg that are starting 
from two arithmetical progressions of endings of P,. occurrences. 



We will process each ar.pr. separately. We call an ending continental if it is at least \Ps\ far from the 
last ending in progression, otherwise we call it seaside. 
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continental seaside 

Since we have an ar.pr. of Pr occurrences that have common position (property 4 of Local PM), all 
substrings of length |Fs| starting from continental endings are identical. Hence, we need to check all 
seaside endings and only one continental position. For checking the seaside region we just apply Local 
PM for I Fs I -neighborhood of last endpoint and intersect the answer with ar.pr. of seaside ending positions. 
Intersecting two ar.pr. could be done in time 0(log \T\) by technique similar to extended Euclid algorithm. 
For checking continental region we apply Local PM for \Ps\ substring starting from the first continental 
ending. 

As the answer we obtain all continental endings/or none of them, plus some sub-progression of seaside 
endings, plus something similar for the second ar.pr. Since all of these four parts are ar.pr. going one after 
another, we could simplify the answer to one ar.pr. (it must be one ar.pr. by Basic Lemma) in time 0(1). 

Let us estimate the complexity of the procedure for computing a new element. We use one Local PM 
call for Pr, four Local PM calls for Pg, and twice compute the intersection of arithmetical progressions. 
Hence, we have 7 steps of 0{n) complexity. 



C. Realization of Local PM 

By the first step (crawling procedure) we put all occurrences of Pi in [a, (3] interval of Tj to the list. 
On the second step (merging procedure) we merge all these occurrences to just two ar.pr. 

In crawling procedure on (i,j, [a,/?]) we take the ar.pr. of occurrences of Pi in Tj that touch the 
cut, leave only occurrences within the interval, and output this truncated ar.pr. to the list. After that, we 
check whether the intersection of the interval [a,/9] with left/right part of Tj is at least \Pi\ long. If so, 
we recursively call crawling procedure with the same i, the index of left/right part of Tj, and with this 
intersection interval. 

Consider the set of all intervals we work with during the crawling procedure. Note that, by construction, 
any pair of them are either disjoint or embedded. Moreover, since the initial interval was at most 3|-Pj|, 
there are no four pairwise disjoint intervals in this set. If we consider a sequence of embedded intervals, 
then all intervals correspond to their own intermediate text from Ti, . . . ,Tj. Therefore, there were at most 
3j recursive calls in crawling procedure and it works in time 0{j). 

We assume that crawling procedure always maintain the pointer to the corresponding place in the list. 
This means that at the end we get a sorted list of at most 3n arithmetical progressions. By "sorted" we 
mean that the last element of k-th progression is less than or equal to the first one of A; + 1-th progression. 
It follows from construction of crawling procedure, that output progressions could have only first/last 
elements in common. 

Now in the merging procedure we go through the resulting list of progressions. Namely, we compare the 
distance between the last element of current progression and the first element of the next progression with 
the differences of these two progressions. If all three numbers are equal we merge the next progression 
with the current one. Otherwise we just move to the next progression. 

If we apply the Basic Lemma to 6i = ^^^ and 62 = ^^^ positions, we will see that all occurrences 
of Pi in [a, 13] interval form two (one after another) arithmetical progressions. Namely, those who touch 
5i and those who don't touch but touch 52- Here we use that (3 — a < 3|Fj|, and therefore any occurrence 
of Pi touches either 61 or 82- Hence, our merging procedure will start a new progression at most once. 



D. Discussion on the Algorithm 

Here we point out two possible improvements of the algorithm. Consider in details the "new element 
routine". Note that Local PM uses only 0{h) time, where h is the height of the SLP generating T, while 
intersection of arithmetical progressions uses even 0(log|T|). Hence, if it is possible to "balance" any 
SLP up to 0(log \T\) height, then the bound for working time of our algorithm becomes 0(nm log \T\). 

It is interesting to consider more rules for generating texts, since collage systems [13] and LZ77 [24] 
use concatenations and truncations. Indeed, as Rytter [22] showed, we could leave only concatenations 
expanding the archive just by factor 0(log \T\). But there is a hope that the presented technique will work 
directly for the system of truncation/concatenation rules. More details will follow in a full version. 

We also claim that AP-table might be translated to a polynomial- sized SLP generating all occurrences 
of P in T. 

IV. New Algorithms for Related Problems 

In this section we apply dynamic programming and our AP-table routine for solving more classical 
problems: periods, covers and fingerprinting. For highly compressible texts these algorithms might be faster 
than "unpack- and- solve" approach. These results show that pattern matching is not the only problem that 
we can speed up. Hence, we get a more general understanding of how processing on compressed texts 
really works. In this extended abstract we present only the basic theoretical facts and short sketches of 
the algorithms. 



A. Covers and Periods 

A period of string T is a string W (and also an integer \W\) such that T is a prefix of W^ for some 
integer k. A cover (originated from [3]) of a string T is a string C such that any character in T is covered 
by some occurrence of C in T. Note that every cover/period is uniquely determined by its length since 
by the definition they all are prefixes of T. We use notation t = \T\ in this section. 
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Problem of compressed periods/covers: given a compressed string T, to find a length of minimal 
period/cover and to compute a "compressed" representation of all periods/covers. 
Theoretical facts serving as a basis for our algorithms: 

1) Given an SLP of size n and for a given substring we can construct a linear-sized SLP (generating 
this substring) in linear time. 

2) Lengths of periods of the t-string between t — -^ and t — ^Src form a single ar.pr. 

3) Any cover is a border (prefix and suffix of a given text). Any given text has a border of length u 
iff it has a period of length t — u. 

4) If some border in the interval [^i^, ^] is a cover, then all smaller borders in that interval are also 
covers. 
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Sketch of algorithm for compressed periods. Using Fact 3 we will search for borders instead of 

^] intervals. Let us fix some k. Step 1: we 



periods. We will separately find all periods in each of 

take a 2FFT-prcfix of T (let us denote it by Pk) and find all occurrences of this prefix in ^-suffix of T. 



2^+1 ' 2^ 



By the Basic Lemma this is a single ar.pr. The distances between starting positions of occurrences of P^ 
and the end of T are candidate borders. 

Step 2: we take a ^Frr-suffix of T (let us denote it by St) and find all occurrences of this suffix in 
^-prefix of T. By the Basic Lemma this is a single ar.pr. The distances between the start of T and the 
ending positions of occurrences of Sk and the end of T are second candidate borders. 

Fact: to obtain the set of lengths of all borders in our interval we should simply intersect the sets 
of candidate and second candidate borders. Let us estimate the complexity of processing one interval. 
We apply FCPM algorithm to patterns that are substrings of initial text T. By Fact 1 we can efficiently 
construct their SLP representation. Moreover, the sizes of SLPs generating these patterns are larger than 
SLP generating T by at most constant factor. Step 1 requires a single FCPM call, step 2 also requires a 
single call, intersection of two ar.pr. could be done in linear time. Hence, we need 0{n^) time for one 
interval and 0{n^ log |T|) time for the whole compressed periods problem. 

The compressed periods problem was introduced in 1996 in the extended abstract [8]. However, the 
full version of their algorithm (it works in 0{n^ log^ |T|) time) was never published. 

Sketch of algorithm for compressed covers. We will separately solve problem for cover size between 
gFFT and ^ for every k. As it was shown in the previous algorithm, we can find all borders with lengths 
in one such interval in time 0{n^). Then using Fact 4 we apply binary search with cover-check procedure. 

Cover check procedure (for given compressed strings C and T, to check whether C is a cover of T): 

1) Compute AP-table for C and T. 

2) For every intermediate text Tj check by Local PM procedure whether all |C| -neighborhood of cut 
in Tj, excepting only characters that are less than |C|-close to the ends of Tj, are covered by 
C-occurrences. 

3) Check by Local PM procedure whether |C|-prefix and |C|-suffix of T are completely covered by 
C-occurrences. 

We can prove by induction that C is a cover iff we get only "yes" answers on Step 2 and Step 3. 

Complexity analysis. Finding borders in all intervals requires 0{n^ log \T\) time. Cover-check procedure 
requires 0{n^) time, since Step 2 and Step 3 use only O(n^) time. We call cover-check procedure at most 
log \T\ times in each of log \T\ intervals. Hence, the total complexity is 0(n^log^ |T|). 

B. Fingerprint table 

Let S be an alphabet. A fingerprint is a set of used characters of any substring in a text T. A fingerprint 
table is a set of all fingerprints for a given text. The algorithm for computing fingerprint table for usual 
(uncompressed) strings is presented in [1]. Important fact: for any string there might be at most |S| + 1 
(including ) different fingerprints for all its suffixes (prefixes). 

Compressed fingerprint table: given a compressed string T, to compute a fingerprint table. 

Sketch of the algorithm. For every j compute all prefix-fingerprint, all suffix-fingerprint and add to 
the table all fingerprints of "cut-containing" substrings of Tj by induction. At the end we clean the table 
from the repeated fingerprints. Hence, we have 0(n|Splogn) complexity and 0(n|Sp) output size. 

C. Hardness result 

Compressed Hamming distance (inequality version): given integer h and SLPs generating Ti and T2, to 
check whether the Hamming distance (the number of characters which differ) between them is less than 
h. 

Theorem 1: Compressed Hamming distance is NP- and CONP-hard (even inequality version). 
Proof: Recall a well-known NP-complete Subset Sum problem [7]: Given integers wi, . . . , Wn, t 
in binary form, to determine whether there exist xi, . . . ,Xn E {0, 1} such that X]r=i ^i' ''^i = ^• 

Let us fix some input values for Subset Sum. We now (efficiently) construct P, Q and h such that 
HD{P, Q) <h iff Subset Sum has a positive answer. 



Let s = Wi + . . . Wn- Then we take 

2"-l 

Here x-iu = '^xiWi, and Y[ denotes concatenation. The string T (we call it Lohrey string) was presented 
for the first time in Lohrey's work [15] and then also used in [14]. It was proved in [15] that given input 
data of Subset Sum we can construct polynomial- size SLP generating P and T in polynomial time. 
But the inequality HD{P, T) < 2"+^ holds iff the Subset Sum answer is "yes". Hence, the compressed 
Hamming distance is NP-hard. Inverting all bits in P and taking h = \P\ — 2"+^ + 1 we can prove 
CONP-hardness. ■ 

V. Open Problems and Directions for Further Research 
Algorithms. 

• To speed up the presented algorithm for fully compressed pattern matching. Conjecture: 
0(nm log |T|) time is enough. More precisely, to show that computing new element of AP-table 
could be done in time 0(\og \T\). 

• To construct a faster algorithm and/or with less memory requirements for compressed window 
subsequence problem. This problem was solved in paper [5] in 0{nk'^) time and space. Conjecture: 
0{nk'^) time, but 0{nk) space. 

• To construct 0{nm) algorithms for (weighted) edit distance, where n is original length of Ti and m 
is the size of SLP generating T2. This will lead to immediate speed up from any "super- logarithmic" 
compression, since only 0{-^-^) classical algorithm is known for this problem. 

• To speed up the presented algorithm for compressed fingerprint table or find an n-SLP-generated text 
with r2(n|Sp) fingerprint table. 

Complexity. 

• The membership of compressed string in a language described by extended regular language is NP- 
hard. On the other hand it is in PSPACE. To find exact complexity of the problem (repeated from 
[21]). 

• The fully compressed subsequence problem is 02-hard [14]. On the other hand it is in PSPACE. To 
find exact complexity of the problem. 

• The compressed Hamming distance is NP- and CONP-hard. On the other hand it is in PSPACE. To 
find exact complexity of the problem. 

• Is compressed weighted edit distance NP-hard for any system of weights? What is exact complexity 
of the problem? 

Other directions for further research include considering more classical string problems, more powerful 
models of string generation and experimental study of new algorithms on compressed texts. Also it is 
extremely interesting to find effective techniques besides dynamic programming. 
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